We have all seen the rise in the number of forest fires in the past few years, and scientists and climate activists worldwide have raised their concerns. While natural in few ecosystems, these recent forest fires root their cause to the global rise in temperature and poor land management by authorities. Due to this, no matter whether a wildfire's origin is due to human intervention or natural, the drier climate makes it easy for a fire to spread over a region quite intensely.
These fires are a clear call for change because they can result in irreparable damage to forest ecosystems. The gases released due to these fires travel to the city, causing health crises. While governments worldwide introduce initiatives to promote sustainable practices and the general public abides by them, it is also vital for scientists to know which forests are prone to fires. Through this project, we intend to answer the question: "Can we predict a wildfire in this Algerian forests based on given data?"
Algerian forest fires: https://archive.ics.uci.edu/ml/machine-learning-databases/00547/Algerian_forest_fires_dataset_UPDATE.csv
This data set has fourteen variables, thirteen of which are numerical, and one is categorical, along with 243 observations.
# install.packages("skimr")
# install.packages("GGally")
# install.packages("tidyverse")
# install.packages("repr")
# install.packages("tidymodels")
# install.packages("kknn")
library(tidyverse)
library(repr)
library(GGally)
library(tidymodels)
library(skimr)
# URL that contains the dataframe that we want to examine
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00547/Algerian_forest_fires_dataset_UPDATE.csv"
# Reading dataframe into R
algerian_forest_fires <- read_csv(url,skip = 1)
# Examining the data frame
# algerian_forest_data
# Inspecting the parsing failures
# algerian_forest_fires[123, 1:14] -- Empty row between two data sets
# algerian_forest_fires[168, 1:14] -- Needs to be fixed
algerian_forest_fires[168, 10] = '14.6'
algerian_forest_fires[168, 11] = '9'
algerian_forest_fires[168, 12] = '12.5'
algerian_forest_fires[168, 13] = '10.4'
algerian_forest_fires[168, 14] = "fire"
To extract the target data frame, we will split the file into two dataframes, tidy both of them, and lastly join them.
# Retrieving the bejaja data set
bejaja_forest <- slice(algerian_forest_fires, 2:122)
#Fixing the collumn types of the numerical variables
bejaja_forest_data <- bejaja_forest %>%
select(day:FWI) %>%
map_df(as.double)
# Selecting the observation class data
bejaja_forest_fires <- bejaja_forest %>%
select(Classes)
# Joining numerical variable and observation class data
bejaja_forest_tidy <- cbind(bejaja_forest_data,bejaja_forest_fires)
#head(bejaja_forest_tidy)
#Finding the number of rows in the original dataset
number_of_rows <- nrow(algerian_forest_fires)
# Retrieving the sidi-bel data set
sidi_bel_forest <- slice(algerian_forest_fires, 125:number_of_rows)
#Fixing the collumn types of the numerical variables
sidi_bel_forest_data <- sidi_bel_forest %>%
select(day:FWI) %>%
map_df(as.double)
# Selecting the observation class data
sidi_bel_forest_fires <- sidi_bel_forest %>%
select(Classes)
# Joining numerical variable and observation class data
sidi_bel_forest_tidy <- cbind(sidi_bel_forest_data,sidi_bel_forest_fires)
algerian_forest_fires_tidy <- full_join(bejaja_forest_tidy, sidi_bel_forest_tidy)
# algerian_forest_fires_tidy
# The resulting data frame is tidy and ready for the further analysis
# Setting the seed
set.seed(2021)
# Splitting the data set
algerian_forest_fires_tidy <- mutate(algerian_forest_fires_tidy, Classes = as.factor(Classes))
forest_split <- initial_split(algerian_forest_fires_tidy, prop = 0.75, strata = Classes)
forest_train <- training(forest_split)
forest_test <- testing(forest_split)
# Specifying the summary function
my_skim <- skim_with(numeric = sfl(median, mean, sd, min, max),
append = FALSE)
# Creating summary table
summary_df <- my_skim(forest_train) %>%
tibble::as_tibble() %>%
select(skim_variable:numeric.max)
summary_df
After a quick look at a summary table, we can see that there are more fire class labels than non-fire observations.
We believe that this is not an issue since the 30% difference in the number of observations does not warrant upsampling the data.
On the contrary, upsampling the data in this example would probably result in worse performance since each non-fire observation
would gain more weight, making our model favour non-fire observations.
# Plot options
options(repr.plot.width = 18, repr.plot.height= 18)
# Removing the date data
forest_train <- select(forest_train,Temperature:Classes)
# We are using ggpairs to plot our variables against each other and see correlation
ggpairs(forest_train, columns = c("Temperature","RH","Ws","Rain","FFMC","DMC","DC","ISI","BUI","FWI"),
aes(colour = Classes)) +
theme(text = element_text(size = 18),
axis.text = element_text(size = 10))
# Setting the seed value
set.seed(2021)
# Creating a model for tuning
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) %>%
set_engine("kknn") %>%
set_mode("classification")
# Creating a data pre-proccessing recipe
ff_recipe <- recipe(Classes ~ ISI + DMC , data = forest_train) %>%
step_scale(all_predictors()) %>%
step_center(all_predictors())
# Creating a cross validation set
ff_vfold <- vfold_cv(forest_train, v = 10, strata = Classes)
gridvals <- tibble(neighbors = seq(1:15))
# Fitting our model with different k values
knn_results <- workflow() %>%
add_recipe(ff_recipe) %>%
add_model(knn_spec) %>%
tune_grid(resamples = ff_vfold, grid = gridvals) %>%
collect_metrics()
# Selecting a model with the highest accuracy
best_k <- knn_results %>%
filter(.metric == "accuracy") %>%
arrange(desc(mean)) %>%
slice(1) %>%
select(neighbors) %>%
pull()
best_k
# Setting seed value
set.seed(2021)
# Creating final model
ff_knn <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) %>%
set_engine("kknn") %>%
set_mode("classification")
# Fitting the model
ff_fit <- workflow() %>%
add_recipe(ff_recipe) %>%
add_model(ff_knn) %>%
fit(data = forest_train)
# Using our fit to predict on test set
ff_predictions <- predict(ff_fit, forest_test) %>%
bind_cols(forest_test)
# Estimating accuracy of our classifier
ff_metrics <- ff_predictions %>%
metrics(truth = Classes, estimate = .pred_class)
ff_metrics
ff_conf_mat <- ff_predictions %>%
conf_mat(truth = Classes, estimate = .pred_class)
ff_conf_mat
# Creating vector with 100 possible values in ISI range
isi_seq <- seq(from = min(algerian_forest_fires_tidy$ISI, na.rm = TRUE),
to = max(algerian_forest_fires_tidy$ISI, na.rm = TRUE),
length.out = 100)
# Creating a vector with 100 possible values in DMC range
ffmc_seq <- seq(from = min(algerian_forest_fires_tidy$DMC, na.rm = TRUE),
to = max(algerian_forest_fires_tidy$DMC, na.rm = TRUE),
length.out = 100)
#
grid_points <- expand.grid(ISI = isi_seq,
DMC = ffmc_seq)
# Creating a data set with 10000 observations to construct a decision boundary plot
grid_predicted <- ff_fit %>%
predict(grid_points) %>%
bind_cols(grid_points)
# Creating a desicion boundary plot
bui_isi_boundary_plot <- ggplot(grid_predicted, aes(x = DMC, y = ISI, color = .pred_class)) +
geom_point(alpha = 0.3,size = 5.) +
geom_point(data = algerian_forest_fires_tidy, aes(x = DMC, y = ISI, color = Classes)) +
labs(y = "Initial spread index", x = "Duff Moisture Code", color = "Class label") +
theme(text = element_text(size = 18),
axis.text = element_text(size = 10))
bui_isi_boundary_plot